REALEC learner treebank: annotation principles and evaluation of automatic parsing

نویسندگان

  • Olga Lyashevskaya
  • Irina Panteleeva
چکیده

The paper presents a Universal Dependencies (UD) annotation scheme for a learner English corpus. The REALEC dataset consists of essays written in English by Russian-speaking university students in the course of general English. The original corpus is manually annotated for learners’ errors and gives information on the error span, error type, and the possible correction of the mistake provided by experts. The syntactic dependency annotation adds more value to learner corpora since it makes it possible to explore the interaction of syntax and different types of errors. Also, it helps to assess the syntactic complexity of learners’ texts. While adjusting existing dependency parsing tools, one has to take into account to what extent students’ mistakes provoke errors in the parser output. The ungrammatical and stylistically inappropriate utterances may challenge parsers’ algorithms trained on grammatically appropriate academic texts. In our experiments, we compared the output of the dependency parser Ud-pipe (trained on ud-english 2.0) with the results of manual parsing, placing a particular focus on parses of ungrammatical English clauses. We show how mistakes made by students influence the work of the parser. Overall, Ud-pipe performed reasonably well (UAS 92.9, LAS 91.7). We provide the analysis of several cases of erroneous parsing which are due to the incorrect detection of a head, on the one hand, and with the wrong choice of the relation type, on the other hand. We propose some solutions which could improve the automatic output and thus make the syntax-based learner corpus research and assessment of the syntactic complexity more reliable. The REALEC treebank is freely available under the CC BY-SA 3.0 licence.1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Construction of A Chinese Shallow Treebank

This paper presents the construction of a manually annotated Chinese shallow Treebank, named PolyU Treebank. Different from traditional Chinese Treebank based on full parsing, the PolyU Treebank is based on shallow parsing in which only partial syntactical structures are annotated. This Treebank can be used to support shallow parser training, testing and other natural language applications. Phr...

متن کامل

The Norwegian Dependency Treebank

The Norwegian Dependency Treebank is a new syntactic treebank for Norwegian Bokmål and Nynorsk with manual syntactic and morphological annotation, developed at the National Library of Norway in collaboration with the University of Oslo. It is the first publically available treebank for Norwegian. This paper presents the core principles behind the syntactic annotation and how these principles we...

متن کامل

Automatic Adaptation of Annotations

Manually annotated corpora are indispensable resources, yet for many annotation tasks, such as the creation of treebanks, there exist multiple corpora with different and incompatible annotation guidelines. This leads to an inefficient use of human expertise, but it could be remedied by integrating knowledge across corpora with different annotation guidelines. In this article we describe the pro...

متن کامل

Universal Dependencies for Learner English

We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntac...

متن کامل

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018